On Minimizing Disk-read and Download for 
Storage-Node Recovery 

Nihar B. Shah 

Department of Electrical Engineering and Computer Sciences, 
University of California, Berkeley 
nihar @ eecs . berkeley. edu 



(N 

o 

(N 
O 

Q 



> 
(N 

in 

On 



X 



Abstract — We consider the problem of efficient recovery of the 
data stored in any individual node of a distributed storage system, 
from the rest of the nodes in the system. Applications include 
repair of failed storage nodes and degraded reads. We measure 
efficiency in terms of two metrics: the amount of download and 
the amount of disk-read that needs to be performed from the 
other nodes in the recovery process. To minimize the download, 
we focus on the minimum bandwidth setting of the 'regenerating 
codes' model for distributed storage. This model is associated 
to two parameters n and d: the system has a total of n nodes, 
and the data stored in any node must be (efficiently) recoverable 
from any d of the other (n — 1) nodes. Lower bounds on the two 
metrics under this model were derived previously; it has also 
been shown that these bounds are achievable for the download 
as well as disk-read when d = n — 1, and for the amount of 
download when d^n — 1. 

In this paper, we complete the picture by showing that when 
d ^ n— 1, these lower bounds are strictly loose with respect to the 
amount of disk-read required. The proof is information-theoretic, 
and hence applies to non-linear codes as well. We also show that 
under two relaxations of the problem setting, these lower bounds 
can be met for both download and disk-read simultaneously: (a) 
if for disk-reads, these bounds must be met only for recovery 
of data of systematic nodes, and (b) if recovery with minimum 
disk-read may be performed from a specific set of d nodes (as 
opposed to any set of d). 



I. Introduction 

Consider a distributed storage system with n storage nodes, 
each of which has a storage capacity of a bits. Data of 
size B bits is to be stored across these nodes in a manner 
that the entire data can be recovered from any k of the n 
nodes. A problem that has received considerable attention in 
the recent past is that of efficient recovery of the data stored 
in an individual node, from the data stored in the remaining 
nodes in the system. This arises during handling of failures 
in distributed storage systems: upon failure of a node, it is 
replaced by a new node that must (efficiently) recover the 
data stored previously in the failed node from the remaining 
nodes in the system. A second application is that of degraded 
reads: if a node is busy or temporarily unavailable, then any 
request for the data stored in that node is (quickly) served by 
downloading data from the remaining nodes. 

We measure the efficiency of the process of recovering the 
data stored in a node in terms of two metrics: the amount 
of data that must be read from the disks at the other nodes, 
and the amount of data download from them. To optimize the 
amount of download, we consider the minimum bandwidth 



(MBR) setting of the regenerating codes model [1| for dis- 
tributed storage. Under this model, recovery of the data stored 
in any individual node must be accomplished by connecting to 
any d (< n) other nodes and downloading Pd bits of data from 
each of them. Furthermore, under this model, these parameters 
must satisfy the condition 



dPu = a 



(1) 



An intuitive explanation of ([]} is that the recovery of the data 
stored in a node should entail only as much download as the 
amount of data stored. Throughout the rest of this paper, we 
shall assume that (Q~|) is satisfied. 

Under the MBR setting described above, a lower bound on 
the amount of download was derived in |[T| as 

Pd > — — — TIT- ■ (2) 



kd 



It is easy to see that the amount of data that is read from 
a disk at a node is at least as much as the amount of data 
downloaded from that node. Q It follows that the total amount 
of disk-read /3r at any of the d nodes helping in the recovery 
must obey 

Pr > Pd (3) 



and hence is also lower bounded as 

Pr > U 



(4) 



In this paper, we investigate the existence of codes that 
satisfy the aforementioned lower bounds (f2) and (0) with 
equality, i.e., satisfy 

Pd = , , B 7Ts > (5) 



Pr 



B 



(6) 



for the recovery of the data of any of the n nodes from any d 
other nodes in the system. It has been shown previously J2), 
[ 3 1 that when d = n — 1, the amount of download as well as 
disk-read can simultaneously achieve (f5]l and (O respectively 
for the recovery of the data of any individual node. Also, 
explicit codes with a download equalling © for all values 
of the parameters were proposed previously in [4 |. However, 

'The download may be smaller than the disk-read, since the data passed 
may be a (non-injective) function of the data that is read. 
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it is unknown at present whether or not the lower bound on 
the disk-reads (|6]i can also be matched along with that on the 
download © when d ^ n — 1. 

In this paper, we complete this picture by showing that 
under the MBR setting described above, when d ^ n — 1, it is 
impossible to construct codes that simultaneously satisfy (O 
and © for the download and disk-repair respectively. This 
result is obtained via an information-theoretic proof, and hence 
allows us to conclude that these bounds cannot be met even 
with non-linear codes. 

We also consider two relaxations of the problem setting, 
under which we provide explicit codes that can simultaneously 
achieve both © and (0 for all values of the system param- 
eters. Recall that under the setting described above, the data 
of any individual node must be recoverable from any d other 
nodes, with the download and disk-reads satisfying <(5j and (O 
respectively. The two relaxations respectively weaken the two 
"any" criteria with respect to the disk-read. Under the first 
relaxation, we require the disk-read to achieve (|6) for only the 
recovery of the data stored in the systematic nodes (recovery 
of the data of the remaining nodes are allowed to have a larger 
disk-read). This relaxed setting is relevant to the problem of 
degraded reads, where typically, the data stored in (only) the 
systematic nodes is of interest. Under the second relaxation, 
for the recovery of the data of any node, we require that (fO 
be achieved for disk-reads from one set of d other nodes. The 
codes presented for both these relaxations are obtained via 
modifications of the 'product-matrix' codes of [4|. 

We now take a brief digression to discuss a related no- 
tion, that of 'repair-by-transfer', which will be called upon 
frequently in the subsequent sections. Observe that when (01 
and (0 are satisfied, the amount of download j3r> is equal to 
the amount of disk-read /3r. As a result, whenever (O and (|6) 
are met, each of the d nodes helping in the recovery must 
simply pass a part of the data that it stores, without performing 
any computations. This manner of recovering the data stored 
in a node is termed repair-by-transfer [3| (the term 'repair' 
arises from the application of repair of failed nodes considered 
in 0). It follows that a repair-by-transfer code that satisfies (0 
for the amount of download automatically achieves © for the 
disk-read as well. Thus the problem considered in this paper 
can equivalently be stated as follows: for the MBR setting 
described above, under what conditions is it possible to design 
a code that can perform repair-by-transfer with a download 
satisfying (0? 

The remainder of this paper is organized as follows. 
Section HI] describes related literature. Section HU1 presents 
an information-theoretic proof showing the impossibility of 
achieving the previously derived lower bounds on the amount 
of disk-read and download. Section [IV] considers relaxations 
of this setting, and provides explicit codes operating under 
these relaxations. 

II. Related Literature 

As described previously, explicit codes meeting (0 and (0 
for recovery of the data of any node are presented in |0, [0 
for the MBR setting when d = n — 1. The notion of 'repair- 



by-transfer' is also introduced in these papers. The repair-by- 
transfer codes of 0, |0 were subsequently extended to a 
more general but relaxed setting in |pl. In |]51, the condition of 
optimally recovering the data of an individual node from any 
d nodes is relaxed to doing so from specific subsets d nodes, 
with respect to both metrics, the amount of disk-read and the 
download. We note that in contrast, the relaxations presented 
subsequently in this paper makes this relaxation only for the 
amount of disk-read, and the amount of download continues 
to achieve <(3j for every set of d nodes. 

In addition to the MBR setting discussed above, the re- 
generating codes model of [lj has an alternative setting 
associated to it: the minimum storage regeneration (MSR) 
setting. Under the MSR setting, the storage is required to be an 
absolute minimum, and for this value of storage, the amount of 
download is then optimized. The problem of minimizing disk- 
reads is studied for the MSR setting in [6|-[9|. In particular, 
MSR codes performing repair-by-transfer with a minimum 
download for the systematic nodes are constructed in |0- 
[0. A somewhat different setting called 'functional repair' 
is considerd in (9), and MSR codes performing functional 
repair-by-transfer with minimum download for all nodes are 
provided. 

III. Impossibility Result 

This section presents the main result of the paper: when d < 
Ti — l, there cannot exist any code under which the data stored 
in any node can be recovered from any d other nodes while 
satisfying 01 and (0. This result encompasses both linear and 
non-linear codes. The proof of the theorem may be skipped 
without any loss in continuity. 

Theorem 1: Under the MBR setting, when d < n— 1, there 
cannot exist any code that performs repair-by-transfer of any 
node from any d other nodes with a download satisfying (0. 

Proof: The proof proceeds via a contradiction. Let us 
suppose there exists one such code for some system parameters 
with d < n — 1. The proof is divided into three parts. First, 
it is shown that there exist (at least) three nodes that store 
(at least) one bit of data in common. Next, it is shown that 
for recovery of the data of any one of these nodes, the other 
two nodes must pass this bit. Finally, we show that under this 
condition, such an attempt of recovery must necessarily fail. 

Consider recovery of the data stored in nodes {1, . . . , d+ 1} 
(one at a time), from node n and some other arbitrary (d— 1) 
nodes. In each case, node n passes a subset of f3 bits out 
of the a (= d/3) bits that it stores. We emphasize that due 
to the requirement of repair-by-transfer, the bits passed are 
simply a subset of those it stores (and do not arise from any 
computations on the stored bits). In total, node n passes a total 
of (d+ 1)(3 bits out of the d/3 bits it stores. It follows from the 
pigeonhole principle there exists at least one bit that occurs at 
least twice in this set of (d+l)(3 bits. Furthermore, it is shown 
in |0 Property 3] that the f3 bits passed by a node, for recovery 
of the data of any other node, must all be distinct. It follows 
that there must exist at least two nodes out of {1, . . . , d + 1} 
for which node n passes the same bit. Let us assume that these 
two nodes are nodes 1 and 2, and let b denote this common 
bit. 
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For i = 1, . . . , n, let Wj be a random variable corresponding 
to the data stored in node i. It is shown in J3] Property 1] 
that H(Wi) — a, where H(-) denotes the Shannon entropy. 
Furthermore, the quantity Wi is completely deterministic given 
the data passed by the d nodes. Thus, for recovery of the data 
of any node i from some d other nodes, it must be that the 
entropy of the df3 (= a) bits passed by the d nodes is a. As 
a special case it follows that H(b) = 1. It also follows that 
the bit b must be stored in the replacement node, and thus 
H(b\Wi) = and H(b\W 2 ) = 0. Moreover, since bit b was 
originally stored in node n, H{b\W n ) — 0. Thus, the bit b is 
stored in nodes 1, 2 and n, and H (b) = 1. 



Now consider recovering the data of node n from nodes 
{1, . . . , d}. We shall now show that nodes 1 and 2 must both 
pass bit b to node n. Let S\ and S2 be random variables 
corresponding to data passed by nodes 1 and 2 respectively to 
node n. It is shown in E] Property 3] that H{Si) = H(S 2 ) = 
P, and in [3, Property 2] that 7(W„;Wi) = I(W n ;W 2 ) = 
P, where /(•;•) denotes the mutual information. Furthermore, 



since S\ and S 2 are passed by nodes 1 and 2, it must be that 
J?(Si|Wi) = H{S 2 \W 2 ) = 0. Thus, 

2/3 = I(W n ; Wx) + I(W n ; W 2 ) (7) 

> I(W n ;b,S 1 ) + I(W n ;b,S 2 ) (8) 
= I(W n ;S 1 )+H(b\S 1 )-H(b\W n ,S 1 ) 

+I(W n ; S 2 ) + H(b\S 2 ) - H(b\W n , S 2 ) (9) 
= I(W n ; +H(b\S 1 ) +I(W n ; S 2 ) +H(b\S 2 ) (10) 
= H(W n ) - H(JV n \Si) + H(b\Si) 

+H(W n )-H(W n \S 2 )+H(b\S 2 ) (11) 
= dp-H{W n \Si) + H{b\Si) 

+d/3~H{W n \S 2 ) + H(b\S 2 ) (12) 

> dp-id-^p + HmS!) 
+d/3-(d-l)p + H(b\S 2 ) (13) 

= 2/3 + H(b\S 1 ) + H(b\S 2 ) , (14) 

where ([T3l follows from [0 Lemma 3], and the remaining 

equations follow from the other properties discussed previ- 
ously in this proof. Thus, 

H(b\Si) = (15) 

H(b\S 2 ) = . (16) 

It follows that 

2/3 = dfi-{d-2)P (17) 

= H(W n )-(d-2)0 (18) 

< H(W n )-H(W n \S u S 2 ) (19) 
= /(W n ;5i,S' a ) (20) 

< H(S U S 2 ) (21) 

< H{S lt S 2 ,b) (22) 
= HiS^ + HiblS^+HiS^b) (23) 
= f3 + + H(S 1 \S 2 ,b) (24) 

< p + H(Sx\b) (25) 



= P + H(b\S 1 ) + H(S 1 )-H(b) (26) 
= 2/3-1 (27) 

where ( [T9| > follows from J3] Lemma 3], and the remaining 
equations follow from the other properties discussed above. 
Clearly, d27l > yields a contradiction. ■ 

IV. Explicit Codes for Two Relaxations 

Performing repair-by-transfer under the regenerating codes 
setup described above amounts to being able to (efficiently) 
recover the contents of any failed node from any of the d 
nodes. We saw in the previous section that the bounds of (0 
and © cannot be achieved simultaneously when d ^ n — 
1. Thus in this section, we consider two relaxations to this 
setup, which shall allow us to achieve these bounds. The two 
relaxations are obtained by slackening the two instances of 
the quantifier "any" in the above description, for the amount 
of disk-read. Note that under both the relaxations, we shall 
continue to impose the requirements of the MBR setting, i.e., 
of being able to recover the entire data from any k nodes, 
and of satisfying (Q~|) and (f5]l on the amount of download for 
recovery of the data of any node from any d nodes. 

A. Optimal repair-by-transfer for systematic nodes 

A systematic code is defined as a code under which some k 
out of the n nodes store data in raw (uncoded) form. These k 
nodes are called the systematic nodes, while the other (n — k) 
nodes are termed parity nodes. For several applications such as 
degraded reads, efficient recovery of the data in a systematic 
node is of greater importance than that of a parity node. 
Keeping this in mind, we relax the setting described above 
to the following requirements: 

« one should be able to recover the data stored in any node 
from any d other nodes with a download equal to (0 

• one should be able to recover the data stored in any 
systematic node, from any d other nodes, with a disk- 
read and download equal to © and (0 respectively. 
In other words, the requirement of repair-by-transfer is relaxed 
to hold only when recovering the data of a systematic node. 

We now provide an explicit code that achieves the condi- 
tions listed above. This code is a modification of the product- 
matrix MBR code of |4|. @ The code is linear, and operates 
over some finite field ¥ q of size q (> n). As in [4], we present 
constructions for the case when (3d = 1 symbol over ¥ q ; 
constructions for a general value of Pd can be obtained via 
multiple concatenations of the Pn = 1 code (see [4, Section I- 
C]). When Po = 1, the storage capacity (Q]i reduces to a = d 
symbols over ¥ q . 

We first present a brief overview of the construction of a 
product-matrix MBR code as in [4|. Let us denote this code 
by C. The product-matrix MBR code is designed to satisfy (0, 
i.e., when p D = 1 symbol over ¥ q , it operates on a data of 



2 While we discuss only the MBR case in this paper, the ideas presented 
here are also applicable to the product-matrix MSR codes of (4J. 
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symbols over F g . Under the encoding mechanism of the code, 
this data is arranged in a specific manner into a (d x d) 
matrix M. Each node i £ {1, ...,n} is associated to a d- 
length vector tp^ The matrix M and the vectors {ipi}™ =1 are 
chosen to satisfy certain conditions (4] Section IV]. One of 
these conditions, which we shall exploit here, is that any d 
of the n vectors {V'illLi are linearly independent. Under the 
product-matrix MBR code, every node i £ {1, . . . , n} stores 
the a (— d) symbols 

V>f M . 

We also assume that the code is systematic [4, Section IV-B] 
with nodes {1, . . . , k} being the systematic nodes. 

It is shown in |4] Theorem 3] that the entire data M can be 
recovered from the data stored in any k of the n nodes. Let 
us now look at recovery of the data stored in an individual 
node i £ {1, . . . ,n} from some d nodes {ji, . . . ,jd}- Under 
the protocol proposed in |4), each of these d nodes computes 
the inner product of the d symbols stored in it with the 
d-length vector and passes the resulting symbol. Thus, 
the aggregate data passed is {ip^Mipi, . . . ,ipJ d Mipi}. It is 
shown |]4] Theorem 2] that from this, the desired data ipfM 
can be obtained. Observe that the amount of download is equal 
to d symbols over F q , and hence the code achieves (0. 

We shall now modify the code C described above to obtain 
a new code C\ that, in addition, also minimizes the disk-reads 
during recovery of the data stored in any systematic node. 
Define a (d x d) matrix as 

*o = [ipi 1P2 ... M • (28) 

Under the code C%, each node i £ {1, . . . , n} stores the a (— 
d) symbols i/jfAI^o (as opposed to storing ifijM in C). 

Let us now verify that the modified code C\ meets all the 
requirements. First, observe that the (d x d) matrix vPq is 
invertible. Thus, the data stored in any node under code C\ 
is equivalent [4, Appendix B] to that stored under code C. 
This immediately results in the fulfilment of the conditions of 
recovery of the entire data from any k nodes, and recovery of 
the data stored in any node from any d nodes with a minimum 
download. Next, consider recovering the data stored in any 
systematic node i £ {1, . . . , k} from any d nodes {ji, . . . , jd}. 
Under code C\, each of these d nodes simply reads and passes 
the i th symbol it stores. Since the i th column of ty lS equal to 
t/ji, every node je (I £ {1, . . . , d}) thus reads and passes the 
single symbol ipJ^Mipi. The data thus obtained is identical 
to that obtained under code C, thereby ensuring successful 
recovery of the data of the i th systematic node. The amount 
of disk-read and download is exactly d, meeting © and (0 
respectively. 

B. Optimal repair-by-transfer from d specific nodes 

In certain applications, the flexibility of minimizing the 
disk-read from any set of d nodes may be an overkill. This 
motivates the next relaxation, under which the following 
constraints must be satisfied: 

> one should be able to recover the data stored in any node 
from any d other nodes with a download equal to (01 



• for any node, there must exist at least one set of d other 
nodes such that recovery from these d nodes entails a 
disk-read and download equal to (0 and <(5j respectively. 
In other words, for recovery of the data stored in a node, the 
requirement of repair-by-transfer is relaxed to hold only for 
any one subset of d nodes. 

We now modify the product-matrix MBR code C described 
above to obtain a code C2 that satisfies these conditions. To 
simplify notation, define an operator © that operates on two 
values in the set {1,.. ,,n} and computes a sum that cycles 
in the set {1, . . . , n}, i.e., for any x, y £ {1, . . . , n}, x(By := 
l + ((x — l + y) mod n). Similarly, let operator G represent a 
difference that cycles in the set {1, . . . , n}, i.e., x Q y := 1 + 
((x—l — y) mod n). Under code C2, each node i £ {1, . . . , n} 
stores the a symbols 

1>i M tyiffll ^i®2 ' ■ ' ^i0d] 

over F q . Since any d vectors of the set {ipi, ■ ■ ■ , ip n } are 
linearly independent, the matrix Up^x " " ' V'ied] 15 

invertible for every i. Thus the data stored by a node under 
code C2 is equivalent [4 Appendix B] to that stored under code 
C. This results in the fulfilment of the properties of recovery of 
the entire data from any k nodes and recovery of data of any 
individual node from any d nodes with a minimum download. 
Under code C2, in order to recover the data stored in any node 
i with a disk-read and download equal to © and (0, the set 
of d nodes (i0l), . . . , (iQd) are queried. Each of these nodes 
£ € {(iQd), . . . , (iQl)} can simply transfer the single symbol 
xpjMxpi without any computation. The data thus obtained is 
identical to that obtained under C, thus allowing for successful 
recovery of the desired data. This meets the requisite bounds 
on the disk-reads and download. 
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